{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a59bf2a9e5015030",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "# 微调大型语言模型以生成波斯语产品目录的 JSON 格式\n",
    "\n",
    "_作者：[Mohammadreza Esmaeiliyan](https://github.com/MrzEsma)_"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "755fc90c27f1cb99",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "在这个 Notebook 中，我们尝试对大型语言模型进行微调，且没有添加额外复杂性。该模型已针对客户级别的 GPU 进行优化，用于生成波斯语产品目录并以 JSON 格式生成结构化输出。它特别适用于从伊朗平台（如 [Basalam](https://basalam.com)、[Divar](https://divar.ir/)、[Digikala](https://www.digikala.com/) 等）上的用户生成内容的非结构化标题和描述中创建结构化输出。\n",
    "\n",
    "你可以在 [我们的 Hugging Face 账户](https://huggingface.co/BaSalam/Llama2-7b-entity-attr-v1) 查看微调后的 LLM。此外，我们还使用了最快的开源推理引擎之一 —— [Vllm](https://github.com/vllm-project/vllm) 来进行推理。\n",
    "\n",
    "让我们开始吧！"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3a35eafbe37e4ad2",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "from datasets import load_dataset\n",
    "from transformers import (\n",
    "    AutoModelForCausalLM,\n",
    "    AutoTokenizer,\n",
    "    BitsAndBytesConfig,\n",
    "    TrainingArguments,\n",
    ")\n",
    "from peft import LoraConfig, PeftModel\n",
    "from trl import SFTTrainer, DataCollatorForCompletionOnlyLM"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30caf9936156e430",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "`peft` 库，或称为参数高效微调（Parameter Efficient Fine-Tuning），旨在更高效地对大型语言模型（LLMs）进行微调。如果像传统神经网络那样打开并微调网络的上层，它将需要大量的计算和显著的 VRAM（显存）。随着近期论文中提出的方法，`peft` 库已被实现，用于高效地对 LLM 进行微调。你可以在这里了解更多关于 `peft` 的信息：[Hugging Face PEFT](https://huggingface.co/blog/peft)。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "261a8f52fe09202e",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## 设置超参数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "96fccf9f7364bac6",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# General parameters\n",
    "model_name = \"NousResearch/Llama-2-7b-chat-hf\"  # The model that you want to train from the Hugging Face hub\n",
    "dataset_name = \"BaSalam/entity-attribute-dataset-GPT-3.5-generated-v1\"  # The instruction dataset to use\n",
    "new_model = \"llama-persian-catalog-generator\"  # The name for fine-tuned LoRA Adaptor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7f69a97083bf19d9",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# LoRA parameters\n",
    "lora_r = 64\n",
    "lora_alpha = lora_r * 2\n",
    "lora_dropout = 0.1\n",
    "target_modules = [\"q_proj\", \"v_proj\", 'k_proj']\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "382296d37668763c",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "LoRA（低秩适配，Low-Rank Adaptation）通过构建并将低秩矩阵添加到每个模型层中来存储权重变化。这种方法仅打开这些层进行微调，而不改变原始模型权重，也不需要长时间训练。生成的权重非常轻量，可以多次产生，从而允许在将 LLM 加载到 RAM 中后微调多个任务。\n",
    "\n",
    "你可以在 [Lightning AI](https://lightning.ai/pages/community/tutorial/lora-llm/) 了解更多关于 LoRA 的信息。对于其他高效的训练方法，请参阅 [Hugging Face 性能训练文档](https://huggingface.co/docs/transformers/perf_train_gpu_one) 和 [SFT Trainer 增强](https://huggingface.co/docs/trl/main/en/sft_trainer#enhance-models-performances-using-neftune)。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "501beb388b6749ea",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# QLoRA parameters\n",
    "load_in_4bit = True\n",
    "bnb_4bit_compute_dtype = \"float16\"\n",
    "bnb_4bit_quant_type = \"nf4\"\n",
    "bnb_4bit_use_double_quant = False\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39149616eb21ec5b",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "QLoRA（量化低秩适配，Quantized Low-Rank Adaptation）是一种高效的微调方法，通过使用 4 位量化，使大型语言模型能够在更小的 GPU 上运行。该方法在减少内存使用的同时，保持了 16 位微调的全部性能，使得在单个 48GB GPU 上也可以微调最多 650 亿参数的模型。QLoRA 结合了 4 位 NormalFloat 数据类型、双重量化和分页优化器，有效管理内存。它允许对具有低秩适配器的模型进行微调，显著提高了 AI 模型开发的可访问性。\n",
    "\n",
    "你可以在 [Hugging Face](https://huggingface.co/blog/4bit-transformers-bitsandbytes) 上了解更多关于 QLoRA 的信息。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "83f51e63e67aa87b",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# TrainingArguments parameters\n",
    "num_train_epochs = 1\n",
    "fp16 = False\n",
    "bf16 = False\n",
    "per_device_train_batch_size = 4\n",
    "gradient_accumulation_steps = 1\n",
    "gradient_checkpointing = True\n",
    "learning_rate = 0.00015\n",
    "weight_decay = 0.01\n",
    "optim = \"paged_adamw_32bit\"\n",
    "lr_scheduler_type = \"cosine\"\n",
    "max_steps = -1\n",
    "warmup_ratio = 0.03\n",
    "group_by_length = True\n",
    "save_steps = 0\n",
    "logging_steps = 25\n",
    "\n",
    "# SFT parameters\n",
    "max_seq_length = None\n",
    "packing = False\n",
    "device_map = {\"\": 0}\n",
    "\n",
    "# Dataset parameters\n",
    "use_special_template = True\n",
    "response_template = ' ### Answer:'\n",
    "instruction_prompt_template = '\"### Human:\"'\n",
    "use_llama_like_model = True\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "234ef91c9c1c0789",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## 模型训练"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8cc58fe0c4b229e0",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Load dataset (you can process it here)\n",
    "dataset = load_dataset(dataset_name, split=\"train\")\n",
    "percent_of_train_dataset = 0.95\n",
    "other_columns = [i for i in dataset.column_names if i not in ['instruction', 'output']]\n",
    "dataset = dataset.remove_columns(other_columns)\n",
    "split_dataset = dataset.train_test_split(train_size=int(dataset.num_rows * percent_of_train_dataset), seed=19, shuffle=False)\n",
    "train_dataset = split_dataset[\"train\"]\n",
    "eval_dataset = split_dataset[\"test\"]\n",
    "print(f\"Size of the train set: {len(train_dataset)}. Size of the validation set: {len(eval_dataset)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8a5216910d0a339a",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Load LoRA configuration\n",
    "peft_config = LoraConfig(\n",
    "    r=lora_r,\n",
    "    lora_alpha=lora_alpha,\n",
    "    lora_dropout=lora_dropout,\n",
    "    bias=\"none\",\n",
    "    task_type=\"CAUSAL_LM\",\n",
    "    target_modules=target_modules\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "230bfceb895c6738",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "`LoraConfig` 对象用于在使用 Peft 库时配置 LoRA（低秩适配）设置。它可以帮助减少需要微调的参数数量，从而加速训练并减少内存使用。以下是各个参数的详细说明：\n",
    "\n",
    "- **`r`**：LoRA 中低秩矩阵的秩。该参数控制低秩适配的维度，直接影响模型适应能力和计算成本。\n",
    "- **`lora_alpha`**：该参数控制低秩适配矩阵的缩放因子。较高的 alpha 值可以提高模型学习新任务的能力。\n",
    "- **`lora_dropout`**：LoRA 的 dropout 比率。该参数有助于在微调过程中防止过拟合。在此案例中，设为 0.1。\n",
    "- **`bias`**：指定是否向低秩矩阵添加偏置项。在此案例中，设为 `\"none\"`，表示不添加偏置项。\n",
    "- **`task_type`**：定义模型微调的任务类型。在这里，`\"CAUSAL_LM\"` 表示任务是因果语言建模任务，即预测序列中的下一个单词。\n",
    "- **`target_modules`**：指定模型中将应用 LoRA 的模块。在此案例中，设为 `[\"q_proj\", \"v_proj\", 'k_proj']`，即模型注意力机制中的查询（query）、值（value）和键（key）投影层。\n",
    "\n",
    "通过调整这些参数，`LoraConfig` 有助于在微调过程中高效地应用 LoRA 方法，从而优化计算资源和训练效率。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "32d8aa11a6d47e0d",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Load QLoRA configuration\n",
    "compute_dtype = getattr(torch, bnb_4bit_compute_dtype)\n",
    "\n",
    "bnb_config = BitsAndBytesConfig(\n",
    "    load_in_4bit=load_in_4bit,\n",
    "    bnb_4bit_quant_type=bnb_4bit_quant_type,\n",
    "    bnb_4bit_compute_dtype=compute_dtype,\n",
    "    bnb_4bit_use_double_quant=bnb_4bit_use_double_quant,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "535275d96f478839",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "这个代码块配置了使用 **BitsAndBytes (bnb)** 库的设置，该库提供了高效的内存管理和压缩技术，专为 PyTorch 模型设计。具体来说，它定义了如何加载和量化模型权重为 4 位精度，这对于减少内存使用并可能加速推理非常有用。\n",
    "\n",
    "- **`load_in_4bit`**：一个布尔值，决定是否以 4 位精度加载模型。\n",
    "- **`bnb_4bit_quant_type`**：指定使用哪种类型的 4 位量化。在这里，设置为 4 位 NormalFloat (NF4) 量化类型，这是 QLoRA 中引入的一种新数据类型。该类型对于正态分布权重在信息理论上最优，提供了一种高效的方式来对模型进行量化以进行微调。\n",
    "- **`bnb_4bit_compute_dtype`**：设置用于涉及量化模型计算的数据类型。在 QLoRA 中，设置为 `\"float16\"`，这是混合精度训练中常用的类型，旨在平衡性能和精度。\n",
    "- **`bnb_4bit_use_double_quant`**：一个布尔值，指示是否使用双重量化。设置为 `False` 表示只使用单次量化，通常速度更快，但可能精度稍低。\n",
    "\n",
    "### 为什么我们有两种数据类型（quant_type 和 compute_type）？\n",
    "\n",
    "QLoRA 使用了两种不同的数据类型：一种用于存储基础模型权重（在这里是 4 位 NormalFloat），另一种用于计算操作（16 位）。在前向和反向传播过程中，QLoRA 会将权重从存储格式解量化为计算格式。然而，它仅为 LoRA 参数计算梯度，这些参数使用的是 16 位 bfloat 数据类型。这样的做法确保了权重只有在必要时才会解压，从而在训练和推理过程中保持较低的内存使用量。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bacbbc9ddd19504d",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Load base model\n",
    "model = AutoModelForCausalLM.from_pretrained(\n",
    "    model_name,\n",
    "    quantization_config=bnb_config,\n",
    "    device_map=device_map\n",
    ")\n",
    "model.config.use_cache = False"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a82c50bc69c3632b",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Set training parameters\n",
    "training_arguments = TrainingArguments(\n",
    "    output_dir=new_model,\n",
    "    num_train_epochs=num_train_epochs,\n",
    "    per_device_train_batch_size=per_device_train_batch_size,\n",
    "    gradient_accumulation_steps=gradient_accumulation_steps,\n",
    "    optim=optim,\n",
    "    save_steps=save_steps,\n",
    "    logging_steps=logging_steps,\n",
    "    learning_rate=learning_rate,\n",
    "    weight_decay=weight_decay,\n",
    "    fp16=fp16,\n",
    "    bf16=bf16,\n",
    "    max_steps=max_steps,\n",
    "    warmup_ratio=warmup_ratio,\n",
    "    gradient_checkpointing=gradient_checkpointing,\n",
    "    group_by_length=group_by_length,\n",
    "    lr_scheduler_type=lr_scheduler_type\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c86b66f59bee28dc",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Load tokenizer\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\n",
    "tokenizer.pad_token = tokenizer.eos_token\n",
    "tokenizer.padding_side = \"right\"  # Fix weird overflow issue with fp16 training\n",
    "if not tokenizer.chat_template:\n",
    "    tokenizer.chat_template = \"{% for message in messages %}{{'<|im_start|>' + message['role'] + '\\n' + message['content'] + '<|im_end|>' + '\\n'}}{% endfor %}\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea4399c36bcdcbbd",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "关于聊天模板，我们简要说明一下，为了理解用户和模型在模型训练过程中对话的结构，创建了一系列预留短语来区分用户的消息和模型的回复。这确保了模型能够准确理解每条消息的来源，并保持对话结构的连贯性。通常，遵循聊天模板有助于提高任务的准确性。然而，当微调数据集与模型之间存在分布偏移时，使用特定的聊天模板可能更加有助于提升效果。\n",
    "\n",
    "欲了解更多信息，请访问 [Hugging Face 博客关于聊天模板](https://huggingface.co/blog/chat-templates)。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7d3f935e03db79b8",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def special_formatting_prompts(example):\n",
    "    output_texts = []\n",
    "    for i in range(len(example['instruction'])):\n",
    "        text = f\"{instruction_prompt_template}{example['instruction'][i]}\\n{response_template} {example['output'][i]}\"\n",
    "        output_texts.append(text)\n",
    "    return output_texts\n",
    "\n",
    "\n",
    "def normal_formatting_prompts(example):\n",
    "    output_texts = []\n",
    "    for i in range(len(example['instruction'])):\n",
    "        chat_temp = [{\"role\": \"system\", \"content\": example['instruction'][i]},\n",
    "                     {\"role\": \"assistant\", \"content\": example['output'][i]}]\n",
    "        text = tokenizer.apply_chat_template(chat_temp, tokenize=False)\n",
    "        output_texts.append(text)\n",
    "    return output_texts\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "95dc3db0d6c5ddaf",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "if use_special_template:\n",
    "    formatting_func = special_formatting_prompts\n",
    "    if use_llama_like_model:\n",
    "        response_template_ids = tokenizer.encode(response_template, add_special_tokens=False)[2:]\n",
    "        collator = DataCollatorForCompletionOnlyLM(response_template=response_template_ids, tokenizer=tokenizer)\n",
    "    else:\n",
    "        collator = DataCollatorForCompletionOnlyLM(response_template=response_template, tokenizer=tokenizer)\n",
    "else:\n",
    "    formatting_func = normal_formatting_prompts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48e09edab86c4212",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "trainer = SFTTrainer(\n",
    "    model=model,\n",
    "    train_dataset=train_dataset,\n",
    "    eval_dataset=eval_dataset,\n",
    "    peft_config=peft_config,\n",
    "    formatting_func=formatting_func,\n",
    "    data_collator=collator,\n",
    "    max_seq_length=max_seq_length,\n",
    "    processing_class=tokenizer,\n",
    "    args=training_arguments,\n",
    "    packing=packing\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "38fb6fddbca5567e",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "`SFTTrainer` 随后被实例化，用于处理模型的监督微调（SFT）。这个训练器专门针对 SFT 进行设计，并包括了一些额外的参数，例如 `formatting_func` 和 `packing`，这些参数通常在标准的训练器类中是找不到的。\n",
    "\n",
    "**`formatting_func`**：一个自定义函数，用于格式化训练样本，将指令和回复模板组合在一起。\n",
    "**`packing`**：禁用将多个样本打包成一个序列，这是标准 `Trainer` 类中没有的参数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a17a3b28010ce90e",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Train model\n",
    "trainer.train()\n",
    "\n",
    "# Save fine tuned Lora Adaptor \n",
    "trainer.model.save_pretrained(new_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39abd4f63776cc49",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## 推理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "70cca01bc96d9ead",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "import gc\n",
    "\n",
    "\n",
    "def clear_hardwares():\n",
    "    torch.clear_autocast_cache()\n",
    "    torch.cuda.ipc_collect()\n",
    "    torch.cuda.empty_cache()\n",
    "    gc.collect()\n",
    "\n",
    "\n",
    "clear_hardwares()\n",
    "clear_hardwares()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dd8313238b26e95e",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def generate(model, prompt: str, kwargs):\n",
    "    tokenized_prompt = tokenizer(prompt, return_tensors='pt').to(model.device)\n",
    "\n",
    "    prompt_length = len(tokenized_prompt.get('input_ids')[0])\n",
    "\n",
    "    with torch.cuda.amp.autocast():\n",
    "        output_tokens = model.generate(**tokenized_prompt, **kwargs) if kwargs else model.generate(**tokenized_prompt)\n",
    "        output = tokenizer.decode(output_tokens[0][prompt_length:], skip_special_tokens=True)\n",
    "\n",
    "    return output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d3fe5a27fa40ba9",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "base_model = AutoModelForCausalLM.from_pretrained(new_model, return_dict=True, device_map='auto', token='')\n",
    "tokenizer = AutoTokenizer.from_pretrained(new_model, max_length=max_seq_length)\n",
    "model = PeftModel.from_pretrained(base_model, new_model)\n",
    "del base_model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "70682a07fcaaca3f",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "sample = eval_dataset[0]\n",
    "if use_special_template:\n",
    "    prompt = f\"{instruction_prompt_template}{sample['instruction']}\\n{response_template}\"\n",
    "else:\n",
    "    chat_temp = [{\"role\": \"system\", \"content\": sample['instruction']}]\n",
    "    prompt = tokenizer.apply_chat_template(chat_temp, tokenize=False, add_generation_prompt=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "febeb00f0a6f0b5e",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "gen_kwargs = {\"max_new_tokens\": 1024}\n",
    "generated_texts = generate(model=model, prompt=prompt, kwargs=gen_kwargs)\n",
    "print(generated_texts)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c18abf489437a546",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## 合并基础模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4f5f450001bf428f",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "clear_hardwares()\n",
    "merged_model = model.merge_and_unload()\n",
    "clear_hardwares()\n",
    "del model\n",
    "adapter_model_name = 'your_hf_account/your_desired_name'\n",
    "merged_model.push_to_hub(adapter_model_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16775c2ed49bfe11",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "在这里，我们将适配器与基础模型合并，并将合并后的模型推送到模型库中。你也可以只推送适配器到模型库，而避免推送沉重的基础模型文件，方法如下：\n",
    "\n",
    "```python\n",
    "model.push_to_hub(adapter_model_name)\n",
    "```\n",
    "\n",
    "然后，你可以按照以下方式加载模型：\n",
    "\n",
    "```python\n",
    "config = PeftConfig.from_pretrained(adapter_model_name)\n",
    "model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_8bit=True, device_map='auto')\n",
    "tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)\n",
    "\n",
    "# 加载 Lora 模型\n",
    "model = PeftModel.from_pretrained(model, adapter_model_name)\n",
    "```\n",
    "\n",
    "通过这种方法，你能够高效地加载适配器模型，而不需要加载完整的基础模型，从而减少内存消耗并提高推理速度。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4851ef41e4cc4f95",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## 用 [Vllm](https://github.com/vllm-project/vllm) 快速推理\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe82f0a57fe86f60",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "`vllm` 库是目前用于大型语言模型（LLM）推理的最快引擎之一。有关可用选项的比较概览，你可以参考这篇博客：[7 Frameworks for Serving LLMs](https://medium.com/@gsuresh957/7-frameworks-for-serving-llms-5044b533ee88)。\n",
    "\n",
    "在这个示例中，我们对该任务使用了我们微调模型的第一个版本进行推理。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "88bee8960b176e87",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from vllm import LLM, SamplingParams\n",
    "\n",
    "prompt = \"\"\"### Question: here is a product title from a Iranian marketplace.  \\n         give me the Product Entity and Attributes of this product in Persian language.\\n         give the output in this json format: {'attributes': {'attribute_name' : <attribute value>, ...}, 'product_entity': '<product entity>'}.\\n         Don't make assumptions about what values to plug into json. Just give Json not a single word more.\\n         \\nproduct title:\"\"\"\n",
    "user_prompt_template = '### Question: '\n",
    "response_template = ' ### Answer:'\n",
    "\n",
    "llm = LLM(model='BaSalam/Llama2-7b-entity-attr-v1', gpu_memory_utilization=0.9, trust_remote_code=True)\n",
    "\n",
    "product = 'مانتو اسپرت پانیذ قد جلوی کار حدودا 85 سانتی متر قد پشت کار حدودا 88 سانتی متر'\n",
    "sampling_params = SamplingParams(temperature=0.0, max_tokens=75)\n",
    "prompt = f'{user_prompt_template} {prompt}{product}\\n {response_template}'\n",
    "outputs = llm.generate(prompt, sampling_params)\n",
    "\n",
    "print(outputs[0].outputs[0].text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc007ced7ca34bbb",
   "metadata": {},
   "source": [
    "### 样例输出\n",
    "\n",
    "```\n",
    "{\n",
    "    \"attributes\": {\n",
    "        \"قد جلوی کار\": \"85 سانتی متر\",\n",
    "        \"قد پشت کار\": \"88 سانتی متر\"\n",
    "    },\n",
    "    \"product_entity\": \"مانتو اسپرت\"\n",
    "}\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2bfe00769699bbd2",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "在这篇博客中，你可以阅读关于微调大型语言模型（LLMs）的最佳实践：[Sebastian Raschka 的杂志](https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms?r=1h0eu9&utm_campaign=post&utm_medium=web)。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.19"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
